Speeding up k-means by approximating Euclidean distances via block vectors
نویسندگان
چکیده
This paper introduces a new method to approximate Euclidean distances between points using block vectors in combination with the Hölder inequality. By defining lower bounds based on the proposed approximation, cluster algorithms can be considerably accelerated without loss of quality. In extensive experiments, we show a considerable reduction in terms of computational time in comparison to standard methods and the recently proposed Yinyang k-means. Additionally we show that the memory consumption of the presented clustering algorithm does not depend on the number of clusters, which makes the approach suitable for large scale problems.
منابع مشابه
Using the Johnson-Lindenstrauss lemma in linear and integer programming
The Johnson-Lindenstrauss lemma allows dimension reduction on real vectors with low distortion on their pairwise Euclidean distances. This result is often used in algorithms such as k-means or k nearest neighbours since they only use Euclidean distances, and has sometimes been used in optimization algorithms involving the minimization of Euclidean distances. In this paper we introduce a first a...
متن کاملTurning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering
We prove that the sum of the squared Euclidean distances from the n rows of an n×d matrix A to any compact set that is spanned by k vectors in R can be approximated up to (1+ε)-factor, for an arbitrary small ε > 0, using the O(k/ε)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1+ε)approximated by an optimal k-means cl...
متن کاملGeneralizing k-means for an arbitrary distance matrix
The original k-means clustering method works only if the exact vectors representing the data points are known. Therefore calculating the distances from the centroids needs vector operations, since the average of abstract data points is undefined. Existing algorithms can be extended for those cases when the sole input is the distance matrix, and the exact representing vectors are unknown. This e...
متن کاملApproximating the Distributions of Singular Quadratic Expressions and their Ratios
Noncentral indefinite quadratic expressions in possibly non- singular normal vectors are represented in terms of the difference of two positive definite quadratic forms and an independently distributed linear combination of standard normal random variables. This result also ap- plies to quadratic forms in singular normal vectors for which no general representation is currently available. The ...
متن کاملProbabilistic Multidimensional Scaling Using a City-Block Metric.
Using a probabilistic model, exact and approximate probability density functions (PDFs) for city-block distances and distance ratios are developed. The model assumes that stimuli can be represented by random vectors having multivariate normal distributions. Comparisons with the more common Euclidean PDFs are presented. The potential ability of the proposed model to correctly detect Euclidean an...
متن کامل